Classification by Support Vector Machine

alt

== document setup ==

Introduction

Motivation, context, history, related topics ...

Synopsis

Terms

  • Support vector machine
  • Support vector
  • Margin
  • Boundary

Exposition

SVM Applied to Linearly Separable Data

Linearly separable means ...

  • 1-dimensional data (i.e., 1 variable) can be separated into non-overlapping classes by a point.
  • 2-dimensional data (i.e., 2 variables) can be separated into non-overlapping classes by a line.
  • 3-dimensional data (i.e., 3 variables) can be separated into non-overlapping classes by a plane.
  • 4+-dimensional data (i.e., 4+ variables) can be separated into non-overlapping classes by a hyper-plane.

Data

Consider this dataset and new observation.

data
x1 class
1 A
2 A
3 A
6 A
14 B
15 B
16 B
18 B
new
x1
8

Find Support Vectors and Boundary

  • Support vectors are those observations at the edges of the margin separating the classes.
  • The boundary is the middle of the margin separating the classes.

Score & Predict by Score Sign

Score observations based on their distances from the boundary

To predict a new observation's class, find its score in the same way and make the prediction based on its score sign.

Score & Predict by Probability

Alternatively, score observations in a different way, interpret the scores as probabilities, and predict based on probability.

(Various implementations of the support vector machine method employ various schemes to determine probabilities - and may even incorporate some randomization due to cross-validation.)

Note that the sigmoid function rescales any value into the range 0 to 1. The sigmoid function is also known as the logistic function.

sigmoid: $\large \frac{1}{(1+{e}^{−x})}$

sigmoid range: $0 < sigmoid(x) < 1$, for any $x$

First, score observations based on their distances from the boundary

Second, re-scale the scores so that observations on the margin edges have score 1. This step provides for scores to be distributed in an intuitive way.

Third, further re-scale the scores to the range 0 to 1. Use the sigmoid function to do this. This step provides for scores to be interpreted as probabilities.

To predict a new observation's class, find its score in the same way, interpret the score as a probability, and make a prediction based on the probability.

Use the probability with a cutoff to make a prediction.

new
x1 new.score prob.A prob.B
8 0.6224593 0.6224593 0.3775407

SVM Applied to Not Linearly Separable Data: Penalties

Data

Note, these observations are not linearly separable.

data
x1 class
2 A
3 A
4 B
5 A
20 B
21 B
22 B
23 B
new
x1
12

Find Support Vectors and Boundary, Assign Penalties

  • Support vectors are those observations at the edges of the margin separating the classes and any observations on the wrong side of the boundary. The support vectors are determined so that the margin is maximized, but subject to a penalty to account for the offending observations. The penalty depends on the cost and the offending observations' distances from the correct edges. You choose the cost.
  • The boundary is the middle of the margin separating the classes.

Consider observation x1=5 and observation x1=20 as candidates for the support vectors. The gap would be 20-5=15, but observation x1=4 would be distance 16 on the wrong side of the x1=20 edge. If cost was chosen as 0.1, then the penalty would be 0.1x16=1.6, and so the gap with penalty would be 15-1.6=13.4. Similarly, if cost was chosen as 1 or 10, then the gap with penalty would be 15-(1x16)=-1 or 15-(10x16)=-145, respectively.

costgapoffensepenaltygap_with_penalty
0.1 15 16 1.6 13.4
1.0 15 16 16.0 -1.0
10.0 15 16 160.0 -145.0

Consider observation x1=3 and observation x1=20 as candiates for the support vectors. The gap would be 20-3=17, but observation x1=4 would be distance 16 on wrong side of the x1=20 edge, and observation x1=5 would be distance 2 on the wrong side of the x1=3 edge. If cost was chosen as 0.1, then the penalty would be 0.1x(16+2)=1.8, and so the gap with penalty would be 17-1.8=15.2. Similarly, if cost was chosen as 1 or 10, then the gap with penalty would be 17-(1x(16+2))=-1 or 17-(10x(16+2))=-163, respectively.

costgapoffense_2offense_3penaltygap_with_penalty
0.1 17 2 16 1.8 15.2
1.0 17 2 16 18.0 -1.0
10.0 17 2 16 180.0 -163.0

Note, the choice of cost influences the determination of the support vectors, which determine the boundary. In our example, at cost=0.1, observations x1=3 and x1=20 would be preferred over observations x1=5 and x1=20 as support vectors because 15.2 > 13.4. A new observation x1=12 would be classified by sign as B. At cost=10, the reverse would be preferred because -145 > -163, and a new observation x1=12 would be classified by sign as A.

The support vector machine method relies on an optimization algorithm to find the support vectors by maximizing the penalized space between edges.

new
x1
12

SVM Applied to Not Linearly Separable Data: Kernel Trick

Data

data
x1 class
4 A
6 A
7 A
10 B
11 B
12 B
14 B
18 A
19 A
20 A
new
x1
8

Apply Kernel

Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $y = (x-12)^2$.

data
x1 y class
4 64 A
6 36 A
7 25 A
10 4 B
11 1 B
12 0 B
14 4 B
18 36 A
19 49 A
20 64 A

Find Support Vectors and Boundary

  • Support vectors are those observations at the edge of the margin separating the classes, where the margin is maximized.
  • The boundary is the middle of the margin separating the classes (a line in 2-dimensional space).

Score & Predict by Score Sign

Score a new observation based on its distance from the boundary and predict the new observation's class based on its score sign.

Score & Predict by Probability

Alternatively, assign probabilities to observations and based on their scores, and predict a new observation's class based on its probability.

SVM Applied to Very Not Linearly Separable Data: Kernel Trick

Data

data
x1 class
1 B
2 B
4 A
6 A
7 A
10 B
11 B
12 B
14 B
18 A
19 A
20 A
new
x1
8

Apply Kernel

Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $y = -1.8 + 3.0x -0.4x^2 +0.1x^3$.

Find Support Vectors and Boundary

  • Support vectors are those observations at the edge of the margin separating the classes, where the margin is maximized.
  • The boundary is the middle of the margin separating the classes (a line in 2-dimensional space).

Score & Predict by Score Sign

Score a new observation based on its distance from the boundary and predict the new observation's class based on its score sign.

Score & Predict by Probability

Alternatively, assign probabilities to observations and based on their scores, and predict a new observation's class based on its probability.

SVM Applied to Not Linearly Separable Data with 2 Variables: Kernel Trick

Data

data
x1 x2 class
0 3 A
1 5 A
1 6 A
2 6 A
4 7 A
5 7 A
7 0 A
7 4 A
7 5 A
8 2 A
8 4 A
3 2 B
3 3 B
4 1 B
4 2 B
4 3 B
5 2 B
5 3 B
new
x1 x2
6 4.5

Apply Kernel

Introduce a new variable by applying a kernel function to the original dataset. This effectively increases the dimensionality of the dataset. Here, we apply the kernel function $z = (x1-4)^2 + (x2-2)^2$.

Find Support Vectors and Boundary

  • Support vectors are those observations at the edge of the margin separating the classes, where the margin is maximized.
  • The boundary is the middle of the margin separating the classes (a plane in 3-dimensional space).

Hyper-Parameters for Support Vector Machine

  • kernel
    • linear
    • polynomial
      • degree: $x^2$, $x^3$, $x^4$, etc.
      • gamma: affects magnitude of kernel
      • coefficient: affects offset of kernel
    • radial basis (gaussian)
      • gamma = $\frac{1}{2\sigma^2}$: inverse standard deviation squared (variance) used for gaussian
    • hyperbolic tangent
      • coefficient
    • other kernels
  • cost
  • scale: normalized or not normalized
  • tolerance: affects optimization
  • epsilon: affects penalty calculation
  • probability computation method
  • other hyper-parameters

Copyright (c) Berkeley Data Analytics Group, LLC Document revised October 17, 2019